Using Vector Embeddings For Sentiment Analysis

Rod Acosta, Kevin Furbish, Ibrahim Khan, Anthony Washington

1. Introduction

Sentiment Analysis (SA) is a branch of Natural Language Processing (NLP) that computationally extracts the “treatment of opinions, sentiments and subjectivity” (Medhat, Hassan, and Korashy 2014) of digital text.

1.1 Sentiment Analysis

Sentiment analysis, a crucial facet of natural language processing (NLP), involves interpreting and classifying emotions within text data. By understanding public opinion, businesses can refine marketing strategies, improve products, and enhance customer satisfaction. The rapid growth of social media and e-commerce platforms has amplified the importance of sentiment analysis, enabling real-time feedback and insights into consumer behavior.

The study by Hasan, Maliha, and Arifuzzaman (2019) demonstrates the potential of Twitter data for sentiment analysis, highlighting the application of NLP frameworks to gauge public opinion on products. Their methodology integrates the Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) models, coupled with logistic regression, to classify tweets as positive or negative with an accuracy of 85.25%. This approach underscores the benefits of combining BoW and TF-IDF to enhance sentiment analysis precision. (Hasan, Maliha, and Arifuzzaman 2019)

In a similar vein, Kathuria, Sethi, and Negi (2022) explored sentiment analysis on e-commerce reviews, employing machine learning (ML) models such as logistic regression, AdaBoost, SVM, Naive Bayes, and random forest. Utilizing the Women’s E-commerce Clothing Reviews dataset, they analyzed review texts, ratings, and recommendations to understand consumer behavior. Their findings underscore the significance of electronic word-of-mouth (eWOM) in shaping customer attitudes and product sales, providing e-commerce businesses with actionable insights to improve marketing strategies and customer satisfaction. (Kathuria, Sethi, and Negi 2022)

The foundational work by Nasukawa and Yi (2003) emphasizes a more granular approach to sentiment analysis, focusing on extracting sentiments linked to specific subjects rather than entire documents. Their prototype system, employing a syntactic parser and a sentiment lexicon, achieves high precision in detecting sentiments in web pages and news articles. By concentrating on local sentiment expressions, their method offers detailed insights into specific opinions, aiding businesses in monitoring public opinion and addressing unfavorable sentiments effectively. (Nasukawa and Yi 2003)

Collectively, these studies highlight the evolution and application of sentiment analysis using NLP, illustrating its critical role in extracting valuable insights from vast amounts of text data. By leveraging advanced NLP techniques and ML models, businesses can gain a deeper understanding of consumer sentiment, thereby enhancing their strategic decision-making processes.

1.2 Vector Embeddings In Natural Language Processing

One technique for encoding or describing the sentiment of a word or group of words is to use vector embeddings (or embeddings). Before discussing the use of embeddings for SA, it’s important to first understand what embeddings are. Embeddings are said to be “one of the most important topics of interest” in NLP in the last decade (Camacho-Collados and Pilehvar 2020).

The use of word embeddings in NLP is an improvement over representing words as an index into a vocabulary since embeddings can encode relationships or similarity of words. For example, when using a simple index into a vocabulary to represent a word, “boy” would have one index and “man” would have another index, but there is no indication that these two words are related. Vocabulary indexes also fail to represent that a word may have multiple meanings. For example, “mouse” would be one entry in a vocabulary, but would fail to indicate if the word refers to an animal, or computer input device (Camacho-Collados and Pilehvar 2020).

Word embedding models can capture fairly detailed semantic and syntactic patterns, however how these patterns are encoded in a vector is often unclear (Pennington, Socher, and Manning 2014). Unlike indexes into a vocabulary, word embeddings have the advantage of being able to be compared for similarity. Comparison is often done either by measuring distance between vectors, or the angle between vectors (Pennington, Socher, and Manning 2014). (Mikolov et al. 2013) discovered in their famous “Word2Vec” model that simple vector arithmetic over their embedding model allowed evaluation of analogies. For example, \(vector(king) - vector(man) + vector (woman)\) resulted in a vector that was closest to the vector for “queen”.

Embedding models also exist for more complicated NLP problems than word representations. Sentence embedding models can be built on top of word embeddings models. This is an improvement over using “bag of words” representations for sections of natural language longer than a single word. Bag of word models often use vectors of one-hot encoding, however these representations have very high dimensionality and sparseness. These issues led researchers to look for alternatives and they found embeddings to be a valuable technique (Pilehvar and Camacho-Collados 2020). Sentence embeddings compose word embeddings for each of the words of a sentence into a single vector that encodes semantic and syntactic properties of the sentence and can be compared for similarity across those properties (Kiros et al. 2015). As with using bag of words to encode sentences, bag of word models are similarly poor representations of documents for many NLP tasks. In such cases, encoding the document into a document embedding is a useful alternative representation (Pilehvar and Camacho-Collados 2020). Since bag of word models ignore word ordering, they are a problematic representation for many NLP tasks, including sentiment analysis (Pilehvar and Camacho-Collados 2020), which will be the focus for the rest of this work.

1.3 Using Embeddings For Sentiment Analysis

Now that we have touched on the topics of sentiment analysis and embeddings, let’s dive into the use of embeddings for sentiment analysis. The performance of natural language processing (NLP) tasks including question answering & sentiment analysis, has significantly improved due to word embedding and deep learning models over the years (Kasri et al. 2022). Ultimately our goal when converting text into embeddings is to be able to extract the meaning of the words in the text and allow our model to learn from it. There are two very popular word embedding methods, of which we have already touched upon slightly, the two methods are Word2Vec and Global Vectors (GloVe). Using one of these two methods along with an embedding model we will be able to categorize data into positive, negative, or neutral sentiments by evaluating the overall sentiment of the input data. This branch of NLP has been increasingly becoming more and more popular, specially for processing and gathering the sentiment of posts on social media sites, forums, an the web in general.

Our two popular pretrained word methods were each trained on different datasets. Word2Vec was trained on a portion of the Google News dataset with 300-dimensional vectors and GloVe was trained on 6 billion tokens, 400k vocabulary words, and 100-dimensional vectors from Wikipedia 2014 (N et al. 2024). The drawback to most embedding based models is that they typically only utilize pre-trained word embeddings that can not capture the effective information in the text or input, such as the contextual sentiment information of both targets and aspects (Liang et al. 2023). This is problematic because this leads words with similar contexts but opposite sentiment (good and bad for example) to be mapped to neighboring word vectors making our models less accurate(Tang et al. 2016). We will be tackling this issue more in depth in this paper and exploring other forms of embeddings for sentiment analysis and models along the way!

Methods

What is a neural Network?

  • A neural network is a type of algorithm that mimics the structure and function of the human brain. Their goal is to create an artificial system that can process and analyze data in a similar way.
  • There are different types of neural networks but there are some common elements between most of them. Those elements are:
    • Artificial Neurons
    • Layers

Neural Network Layers

  • Neural networks usually have three types of layers:
    • Input Layer
    • Hidden layers
    • Output layer

What are embeddings?

  • Embeddings are a technique that allow us to map words or phrases into a corresponding vector of real numbers, where the position and direction of the vector capture the word’s semantic meaning in relation to other words.
  • They make high-dimensional data like words readable to our algorithm/model and allows our model to recognize and learn meaningful relationships and similarities between words

Dense Layer & Cosine Similarity

  • Cosine Similarity
    • Measures the cosine of the angle between two non-zero vectors, providing a measure of similarity.
    • The smaller the angle the higher the similarity between the two vectors.
    • \(cosine\_similarity(u,v) = \frac{u.v}{||u|| ||v||}\)
  • Dense Layer
    • A logistic regression model with a sigmoid activation function used for binary classification.
    • It outputs the probability that the input belongs to a positive class.
    • \(y=\sigma(W⋅z+b)\)
    • Where:
      • z is the flattened input vector.
      • W is the weight vector.
      • b is the bias term.
      • \(\sigma(x) = \frac{1}{1+e^{-x}}\) is the sigmoid function.

Sentiment Analysis

  • Through the use of a neural network and it’s hidden layers (embedding & dense), and the cosine similarity we are able to take inputs and classify them as being part of a positive or negative class based on what our model has learned from our training dataset.

3. Analysis and Results

3.1 Dataset Description

For sentiment analysis, this project will use a data set of movie reviews from IMDB (Maas et al. 2011). The dataset includes 25,000 movie reviews, and since some movies are reviewed more often than others, the dataset includes a maximum of 30 reviews for any particular movie. The dataset includes only the top 5,000 most frequent words, however the top 50 most frequent words are also discarded as they are unlikely to contribute much to sentiment context. IMDB reviews include a star rating of 1 to 10 stars, and these ratings have been converted to a 0-1 scale for use as a sentiment classification label in the dataset.

The TensorFlow package (Abadi et al. 2015) includes this dataset in a vectorized format which is ideal for use in neural networks. The vectorization process starts by assigning each word that appears in the vocabularly (i.e. all the unique words in the dataset) with a unique number for a numeric substitution value. Then each word in the original text observation is replaced with the substitution number assigned to that word. Every observation is translated in this same way to convert from a string made up of words to a vector of integers. An example follows below:

Observation #1: “this is fun”

Observation #2: “fun times ahead”

Observation #3: “fun is ahead of times”

Based on the three above observations, there are six unique words in the vocabularly. These six unique words would each be assigned a numeric value. Thus the vocabulary list would be [1-this, 2-is, 3-fun, 4-times, 5-ahead, 6-of]. To vectorize the strings, the numeric values for each of those words are added to a integer vector. The vectorized observations are shown below.

Observation #1 vectorized: [1,2,3]

Observation #2 vectorized: [3,4,5]

Observation #3 vectorized: [3,2,5,6,4]

Install and Import Packages

Loading and preparing Training and Test Data

Load the tensorflow IMDB review dataset. Only the top most common 5000 words will be included. All other words will be replaced with a token representing an unknown word. Up to the first 500 words in a review are included in the training and test sets.

The neural network will be expecting batches of training examples that are 500 words long, so pad any observations shorter than 500 words with a token representing the padding word to get to the required 500 word length.

Statistical Modeling

In this section a neural network model will be implemented using the keras package to learn embeddings for each of the words in the vocabulary. The goal is to learn multi-dimensional vectors where similar words are close in the vector-space, where similar means having a similar contextual meaning with regards to the training dataset and its sentiment classification. For example, “gem” and “favorite” would be highly similar in the context of a movie review, whereas in a general context they would not be so similar.

The number of dimensions of the output embedding will be varied and tested as part of the modeling process. The number of dimensions is an important hyperparameter since it will control how much compression of the training set occurs. A small number of dimensions results in a higher amount of compression, whereas a large number of dimensions allows for more detail to be captured by the embeddings. However, a larger number of dimensions can also lead to overfitting (Yin and Shen 2018).

Next the embedding model is trained. (Chollet and Allaire 2018) and (Monroe, n.d.) were important resources in coding the embedding training. As a first step, the model will be trained repeatedly with a different number of dimensions each time. Models will be trained using from 2 to 7 dimensions, and the testing accuracy will be recorded for each model.

The neural network model uses an embedding layer that will convert the words in the vocabulary to a multi-dimensional vector embedding once trained. The number of inputs to the embedding layer is 5000, which corresponds to the number of words in the vocabulary. The selected number of outputs for the embedding layer is the dimensionality of the embedding vector. As mentioned previously, this dimensionality will be varried to test the performance of the embeddings across different sizes of embedding dimensions. A second layer in the neural network model flattens the 3 dimensional tensor output from the embedding layer to a 2 dimensional tensor. Finally, a dense layer connects every output from the flatten layer to the final output layer. The model is trained using back propagation to predict the sentiment classification variable, and the final trained weights of the embedding layer are the embeddings for each corresponding word in the vocabulary.

782/782 - 0s - 588us/step - acc: 0.8792 - loss: 0.2854
782/782 - 0s - 608us/step - acc: 0.8782 - loss: 0.2879
782/782 - 0s - 573us/step - acc: 0.8767 - loss: 0.2940
782/782 - 0s - 579us/step - acc: 0.8761 - loss: 0.2967
782/782 - 0s - 573us/step - acc: 0.8767 - loss: 0.2974
782/782 - 1s - 668us/step - acc: 0.8743 - loss: 0.3038
      2       3       4       5       6       7 
0.87924 0.87820 0.87672 0.87608 0.87672 0.87432 

Surprisingly, an embedding of just 2 dimensions had the best accuracy. That may be the highest accuracy in predicting the binary sentiment classification, but the question is does that over compress the data and fail to represent the higher order patterns we hope the embedding models? As (Yin and Shen 2018) points out, “the impact of dimensionality on word embedding has not yet been fully understood…a word embedding with a small dimensionality is typically not expressive enough to capture all possible word relations, whereas one with a very large dimensionality suffers from over-fitting.”

The model with just 2 dimensions is tested to see how well it does on finding similar words, where similar is in the context of the sentiment of a movie review.

782/782 - 0s - 590us/step - acc: 0.8805 - loss: 0.2855

The words “awful”, “mediocre”, “perfect” and “favorite” are some positive and negative words that could be found in a movie review. These test words are ysed to qualitatively test the embedding model by examining which words are found to be close to the test words.

    awful      lame alcoholic     sadly  relevant       are 
1.0000000 1.0000000 1.0000000 0.9999998 0.9999998 0.9999998 
  mediocre     effort     turkey   terrible stereotype     repeat 
 1.0000000  1.0000000  1.0000000  0.9999999  0.9999999  0.9999998 
  perfect    lovers      sing   manager      bath    donald 
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9999999 
 favorite    deeply     round     marie  polanski    poetry 
1.0000000 1.0000000 1.0000000 1.0000000 0.9999999 0.9999999 

Some related words are found, but there are some other words that don’t seem to be very related. Overall the results don’t appear very good, so it seems embeddings using only 2 dimensions is not adequate despite the high accuracy found on the test set.

(Yin and Shen 2018) states that selecting the number of dimensions is often done ad hoc or by using grid search, with a common method being to train embeddings of different dimensions and evaluate them models using a functionality test like word analogy. A similar method was used here on a smaller scale where the embedding model was retrained using a larger number of dimensions and the performance of the related words test was compared. There was insufficient time to test many model variations, but it was important to test a larger number of dimensions to compare to the 2 dimension model. The test accuracy for the model previously trained with 7 dimensions was fairly close to the accuracy for 2 dimensions, so that embedding length was tested next.

782/782 - 0s - 571us/step - acc: 0.8736 - loss: 0.3002

Here are the similar words for the same positive and negative words that were previously tested, but now tested using the new embedding model with higher dimensionality:

     awful ultimately    painful      sorry       fake    nowhere 
 1.0000000  0.9979454  0.9972085  0.9959082  0.9958690  0.9951919 
     mediocre         teeth   incompetent          main disappointing 
    1.0000000     0.9969511     0.9960854     0.9958095     0.9956890 
     generous 
    0.9951328 
   perfect      great    seeking    freedom tremendous  excellent 
 1.0000000  0.9983013  0.9981593  0.9978434  0.9972070  0.9955213 
 favorite    paulie excellent necessary     great   seeking 
1.0000000 0.9986090 0.9984505 0.9983542 0.9971755 0.9971190 

These results are better than the 2 dimension model, so it seems test accuracy isn’t a good method to determine how many dimensions should be included in the embedding model.

Next, the number of epochs used in training will be evaluated to see how that impacts the model performance.

782/782 - 0s - 574us/step - acc: 0.8665 - loss: 0.3540

Training Metrics This keras graph of the accuracy of the training data (blue) vs. the testing data (green) shows that the testing accuracy starts to flatten at epoch 6, so it appears 6 epochs is effective. This is the number of epochs previously used in training, so the best model remains 7 dimensions trained with 6 epochs.

Data and Visualization

Conclusion

The development and analysis of the word embedding model for classifying IMDB movie reviews demonstrated promising results. The optimal number of embedding dimensions was identified as 7, achieving an accuracy of 87.34% on the test dataset. This was determined through extensive experimentation, revealing that higher dimensions, such as 7, provided competitive and consistent accuracy. The model’s performance is noteworthy, given the constraints of training on only the top 5000 most common words, minimal data preprocessing, and limiting input sequences to the first 500 words of each review. These factors illustrate the model’s robustness and effectiveness in capturing the semantic relationships within the data.

Furthermore, the embedding similarity results showed that the model could meaningfully capture semantic relationships, as evidenced by the coherent and relevant similar words found for terms like “awful,” “mediocre,” “perfect,” and “favorite.” The final training session, capped at 10 epochs, ensured the model did not overfit, maintaining its accuracy and reliability. Overall, the model’s strong performance under constrained conditions highlights its potential for practical applications in sentiment analysis, offering an efficient and effective solution for understanding and categorizing movie reviews.

Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” https://www.tensorflow.org/.
Camacho-Collados, Jose, and Mohammad Taher Pilehvar. 2020. “Embeddings in Natural Language Processing.” In Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts, edited by Lucia Specia and Daniel Beck, 10–15. Barcelona, Spain (Online): International Committee for Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-tutorials.2.
Chollet, Francois, and J. J. Allaire. 2018. Deep Learning with r. 1st ed. USA: Manning Publications Co.
Hasan, Md. Rakibul, Maisha Maliha, and M. Arifuzzaman. 2019. “Sentiment Analysis with NLP on Twitter Data.” In 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), 1–4. https://doi.org/10.1109/IC4ME247184.2019.9036670.
Kasri, Mohammed, Marouane Birjali, Mohamed Nabil, Abderrahim Beni-Hssane, Anas El-Ansari, and Mohamed El Fissaoui. 2022. “Refining Word Embeddings with Sentiment Information for Sentiment Analysis.” Journal of ICT Standardization 10 (3): 353–82. https://doi.org/10.13052/jicts2245-800X.1031.
Kathuria, Priyanshi, Parth Sethi, and Rithwick Negi. 2022. “Sentiment Analysis on e-Commerce Reviews and Ratings Using ML & NLP Models to Understand Consumer Behavior.” In 2022 International Conference on Recent Trends in Microelectronics, Automation, Computing and Communications Systems (ICMACC), 1–5. https://doi.org/10.1109/ICMACC54824.2022.10093674.
Kiros, Ryan, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. “Skip-Thought Vectors.” Advances in Neural Information Processing Systems 28.
Liang, Bin, Rongdi Yin, Jiachen Du, Lin Gui, Yulan He, Min Yang, and Ruifeng Xu. 2023. “Embedding Refinement Framework for Targeted Aspect-Based Sentiment Analysis.” IEEE Transactions on Affective Computing 14 (1): 279–93. https://doi.org/10.1109/TAFFC.2021.3071388.
Maas, Andrew L., Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. “Learning Word Vectors for Sentiment Analysis.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–50. Portland, Oregon, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1015.
Medhat, Walaa, Ahmed Hassan, and Hoda Korashy. 2014. “Sentiment Analysis Algorithms and Applications: A Survey.” Ain Shams Engineering Journal 5 (4): 1093–113. https://doi.org/10.1016/j.asej.2014.04.011.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/abs/1301.3781.
Monroe, Burt. n.d. “Materials for Classes: ‘Text as Data’ (PLSC 597) at Penn State & ‘Advanced Text as Data: Natural Language Processing’ (2p) at Essex Summer School in Social Science Data Analysis.” TextAsDataCourse. https://burtmonroe.github.io/TextAsDataCourse/.
N, Lavanya B., Anitha Rathnam K. V, Kiran K, P. Deepa Shenoy, and Venugopal K. R. 2024. “Fusion of Deep Learning with Advanced and Traditional Embeddings in Sentiment Analysis.” In 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), 1–6. https://doi.org/10.1109/I2CT61223.2024.10543279.
Nasukawa, Tetsuya, and Jeonghee Yi. 2003. “Sentiment Analysis: Capturing Favorability Using Natural Language Processing.” In Proceedings of the 2nd International Conference on Knowledge Capture, 70–77.
Pennington, Jeffrey, Richard Socher, and Christopher D Manning. 2014. “Glove: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43.
Pilehvar, Mohammad Taher, and Jose Camacho-Collados. 2020. Embeddings in Natural Language Processing: Theory and Advances in Vector Representations of Meaning. Morgan & Claypool Publishers.
Tang, Duyu, Furu Wei, Bing Qin, Nan Yang, Ting Liu, and Ming Zhou. 2016. “Sentiment Embeddings with Applications to Sentiment Analysis.” IEEE Transactions on Knowledge and Data Engineering 28 (2): 496–509. https://doi.org/10.1109/TKDE.2015.2489653.
Yin, Zi, and Yuanyuan Shen. 2018. “On the Dimensionality of Word Embedding.” Advances in Neural Information Processing Systems 31.